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Abstract 


As the demand for higher performance computers for the processing of remote sensing science algorithms 
increases, the need to investigate new computing paradigms is justified, field Programmable Gate Arrays 
enable the implementation of algorithms at the hardware gate level, leading to orders of magnitude performance 
increase over microprocessor based systems. The automatic classification of spaceborne multispectra! images 
is an example of a computation intensive application that can benefit from implementation on an FPGA- 
based custom computing machine (adaptive or reconfigurable computer). A probabilistic neural network is 
used here to classify pixels of a nuiltispectral LANDS AT- 2 image. The implementation described utilizes Java 
client/server application programs to access the adaptive computer from a remote site. Results verify that a 
remote hardware version of the algorithm (implemented on an adaptive computer) is significantly bister than 
a local software version of the same algorithm (implemented on a typical general-purpose computer). 


I. INTRODUCTION 

A new generation of satellites is being developed by the National Aeronautics and Space Admin- 
istration (NASA) to compose the Earth Observing System (EOS). The instruments aboard the EOS 
satellites not only extend the observation life of the current satellites, but they also extend the ca- 
pabilities of remote sensing scientists to better understand the Earth’s environment. Along with the 
scientific advancements of t he new missions, it is also necessary to explore new technologies that fa- 
cilitate and reduce the cost of the data analysis process. In order to process the high volume of data 
generated by the new EOS satellites, NASA is constructing the Distributed Active Archive Centers 
(DAACs), an extensive and powerful parallel computing environment. Scientists will be able to request 
certain data products from these centers for further analysis on their own computing systems. A new 
technology that could bring increased processing power to the scientist's desk, offering more complex 
analysis and interpretation of remote sensed scientific data, is highly desirable. The ultimate scenario 
would be for the scientist to request the data directly from the satellite along with historic data from 
an archive center. 

Field Programmable Gate Array (FPGA)-based computing, also known as ’’adaptive” or ’’reconfig- 
urable computing”, has emerged as a viable computing option in computationally intensive applica- 
tions. These computing syst ems combine the flexibility of general purpose processors with the speed of 
application specific processors. By mapping hardware to FPGAs, the computer designer can optimize 
the hardware for a specific application resulting in acceleration rates of several orders of magnitude 
over general purpose computers. Because the FPGAs arc' personalized using SKAM-based memory 
cells or a fuse programming technology, they can be reconfigured by the designer for other applications. 

Several reconfigurable' computers have been implemented to demonstrate the viability of reconfig- 
urable processors [1], [2], [;{], [4]. Applications mapped to these processors include': pattern recognition 
in high-energy physics [5], applications in statistical physics[6], and genetic optimization algorithms 
[7], [8]. In many cases [9], [10], [11], the reconfigurable computing implementation provided the highest 
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performance, in terms of execution speed. The advent of reconfigurahle processors along with novel 
methods for mapping applications onto adaptive or reconfigurahle processors enables a new comput- 
ing paradigm that may represent the future for remote sensing scientific data processing. In fact, 
many applications utilizing FPGA based computers have been developed showing orders of magnitude 
acceleration over microprocessor based systems [12], [13], [11]. Moreover, microprocessors and FPGAs 
share the same underlying technology - the silicon fabrication process. Therefore, it is reasonable to 
conclude that FPGA based machines can usually outperform microprocessor based systems by orders 
of magnitude [15], [16], [17]. 

To achieve such performance, the application must effectively utilize the available resources. This 
presents a challenge for software designers, who are generally accustomed to mapping applications 
onto fixed computing systems. Generally, the designers examine the available hardware resources, then 
modify their application accordingly. With reconfigurahle computers, the available resources can be 
generated as needed. While it may seem that this flexibility would east' the mapping process, it actually 
introduces new problems, such as what components should be used, and how many of each component 
should be used to generate the best performance. With conventional hardware components, these 
questions are less of an issue. In addition, software engineers are generally not adept, at hardware design. 
Thus, several research groups have developed methods lor mapping applications to r< 'configurable 
processors [2], [18], [ID], [20], [21]. 

The Adaptive Scientific Data Processing (ASDP) group at NASA's Goddard Space Might Center 
(GSFC), in conjunction with researchers at North Carolina State University, have been investigating 
the utilization of FPGA-based computing in the processing of remote sensing scientific algorithms. 
The first prototype developed by the group utilized a coinmercial-olUthe-shelf (COTS) reconfigurahle 
accelerator in the implementation of an automatic classifier for the UANDSAT-2 multispoctral images 
[22]. The implementation discussed in this paper is an extension ol the original prototype that allows 
users to classify the images on the accelerator from a remote site. Results indicate' that a. remote im- 
plementation of the classifier in adaptive computing hardware is faster than a software implementation 
that executes on a local high-end workstation. 

This paper presents details of the FPGA design and is organized as follows. Section 2 describes 
the classifier algorithm that utilizes a probabilistic neural network (PNN). t he implementation of the 
FPGA custom computing machine is then presented. Finally, a performance analysis ol local and 
remote versions of the algorithm is presented. 
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II. The PNN multispectral image classifier 


Remote sensing satellites utilize multispectral scanners to collect information about the Earth’s 
environment [23]. The data collected by such instruments are a set of images, each corresponding to 
one spectral band. A multispectral image pixel is represented by a vector oi size equal to the number 
of bands. The combination of the multiple spectrum measurements represented by each element 
of the pixel vector determine a. signature that corresponds to a physical object being viewed by the 
satellite. Through the observation of a multispectral image and the comparison of pixel vectors to those 
obtained from known locations (in-situ measurements), a, scientist is a hie to identify unique signatures 
of physical objects and compost' classes. These classes contain multispectral pixel representations of 
physical objects on the earth that are closely related. Example classes include forest, tundra, wetland, 
water, etc. 

Several neural network schemes have been devised for the automatic classification ot multispectral 
images [24]. One in particular, the Probabilistic Neural Network (PNN) classifier [25], exhibits ac- 
ceptable accuracy, very small training time, robustness to weight changes, and negligible retraining 
time. A description of I ho derivation of the PNN classifier and details of the network implementa- 
tion including rate of false alarms, neural network size, etc. are presented in Clultri ct. al. [25]. The 
Blackhills (South Dakota. ESA) data set was generated by the Landsat 2 multispectral scanner (MSS). 
The image’s four spectral hands (0.5-0. 6 y^m, 0.6-0. 7 /nil, 0. 7-0.8 //in, and 0.8-1. 1 /uxi) correspond to 
channels 4 through 7 of the Landsat MSS sensor. There are 262,144 pixels corresponding to a 512x512 
pixel image size, and each pixel represents a 76m x 76m ground area.; the images were obtained in 
1973. The ground truth was provided by the United States (.Geological Survey. 

Figure 1 illustrates the PNN classifier procedure. Each multispectral pixel, represented by a vector, 
is compared to a set of pixels belonging to a class. A probability value is calculated for each class. 
The highest value indicates the class into which the pixel fits. luj. 1 is used to derive a. value that 
indicates the probability that the pixel fits into class 5*. 

/( A I Sk) = Kl[k] £ n\,;| (1) 

t = l 

where ( X is a pixel vector, W kt is the weight i of class d is the number of bands, ic is t he number 
of classes, P^ is the number of weights per class, and 7vl[/.r], l\2[h) are constants. ) 
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III. The FPGA implementation 


The first step in implementing an application on an adaptive computer is to select the 1 ; PGA-based 
custom coprocessor architecture that best matches the algorithm in question. At the current state of 
the technology, certain FPGA architectures provide better perlormance than others lor a particular 
class of applications. A preliminary analysis of the PNN classifier indicated that the FPGA architecture 
[26], shown in Figure 2, matched well with the algorithm. The selected f PGA architecture is composed 
of a PCI bus based motherboard and up to 16 plug-in modules. These plug-in modules each contain 
two Xilinx 4013E FPGA devieesfXFPGA and YFPGA) and provide an equivalent of 13,000 gates 
per FPGA, or 26,000 gates per module. The design implementation required approximately 1160 
CLBs (85% utilization) per FPGA. Since the module contained two FPGAs and two separate memory 
modules (connected via the 1 1 B I S), we can perform two lookup table 1 ( Id T ) operations simultaneously. 

A. A Igo rith m part it ion i n <j 

The computation intensive 1 port ion of the multispectral image classilication algorithm found in Eq. 1 
was identified by profiling an implementation of the algorithm that was written using the C program- 
ming language. This computation was selected to be executed on the FPGA coprocessor to improve 
performance for the complete classification algorithm. I he graphical user interface, data storage, 
adaptive coprocessor init ialization code, algorithm synchronization, and data I/O is performed by the 
host processor. The compute intensive PNN classification algorithm equations were mapped onto a 
single module. 

Figure 3 illustrates the algorit hm partitioning. The host processor displays the image during classi- 
fication. The host then sends a pixel vector to the FPGA coprocessor. Glassification is performed on 
the coprocessor and results are returned to the host to be displayed. I he host also computes the total 
time required to process a complete image. If we wish to use multiple modules as coprocessors, the 
host schedules a pixel vector to be processed on each module in a round-robin fashion, then gathers 
the results as they become 1 available. 

B. FPGA application desufn 

Due to the limited number of gates available on a single 1 FPGA. it was not feasible to use floating 
point arithmetic in our implementation of the PNN algorithm. We therefore 1 trausle)nned the algorithm 
to use fixed point arithmetic prior to hardware implementation. The width of the fixed point datapath 
was determined by simulating variable bit operations in G and comparing the results obtained from 
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the original algorithm in floating point. Once the fixer] point classification of the Hlack hills data 
set yielded exactly the same' results as the floating point version, the data path width lor the FPGA 
implementation was known. (Since the output of the PNN classifier is simply a 4-bit value representing 
the class that matches t he pixel, the fixed point version produced exactly the same result as the floating 
point version. Hence there is no loss in precision due to implementation using fixed point arithmetic.) 

Figure 4 shows the data flow diagram for the hardware' implementation of tin 1 PNN classifier. A 
portion of the design was mapped onto the XFPGA and the remaining blocks were implemented on the 
YFPGA of the module. The number of bands (d) was fixed to 4, the maximum value ol the number 
of weights per class (TV) was fixed to 512, and the maximum number of classes (/.:) was set to 16. As 
shown in Eq 1, there are two constants, K 1 and A'2, that, are class dependent. ' These constants are 
pre-calculated on the host and downloaded to memory banks residing on the modules. 

The weight memory was mapped to the SRAM that is connected to t he Y FPGA on the module. The 
weight memory can be as large as 16*512*4*2 bytes = 112768 16-bit. words. Each weight, value occupies 
10-bits. Since each class can have up to 512 weights, an array that holds the number of weights for 
each class is employed. The inputs of the array are also visible Irom the host processor. 

A 4-bit register holds the number of classes. This register is initialized by the host before loading the 
FPGA coprocessors. Due to the lack of space on the XFPGA. the 1\ i multiplier and the class compar- 
ison blocks were moved to the host. These calculations amount to /,* multiplications and comparisons 
per pixel classification. Since the number of classes, is small, they do not. account lor a significant 
amount of the computation, leading to a small performance penalty. Tor example, it the number of 
classes k — 16, the maximum number of weights per class I\ = 512, and we are classifying a 512x512 
image with d = 4 spectral bands, Eq. 1 is calculated 16 times. The performance penalty amounts to 
only 16 multiplications and 16 comparisons per 512x512 image that are executed on the host rather 
than executed on the FPGA. This is a small overhead relative to the more than 5 12 3 multiplications 
that are computed on the FPGA for this example. 

Figure 4 contains a Subtraction l ’nit that computes W, a. 1 x 10-bit element vector for YV (/cq, w 1} w 2 ^w 3 ) 
minus X (x 0 , £i, £2, £3). The result of the subtraction ranges from 1024 t o 1024, requiring 11 bits in 
two’s complement format. The Square Unit multiplies each 11 -bit element of the Y vector by itself 
(i.e. t 0 = y 0 x y 0 ). The values of the elements of the 1 vector range from 0 to 1,0 16,529. requiring 20 
bits in two’s complement format.. 

The next computation involves the Band Accumulator 1 nit. This unit adds the 4 elements of the 
T vector together resulting in 11, ranging in value from 0 to 4,186.1 16. requiring 22 bits. The K2[k\ 
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Memory holds the K2 values for each class. K 2 = (l/2)r^72, when' rr/. = 2,3, ... , 1 1, 12. As a result, 
K2 varies between 0.125 (<t/ ; — 2), and 0.003472 (cr* = 12). The largest value ol K'2 - 0. 125 in decimal 
and is represented exactly in binary (0.001). In order to increase the precision ol the multiplication, 
the values of K 2 are stored with the decimal point shifted to the right by 2 (multiplied by lour). After 
K2 is multiplied by u in the K2 Multiplier Unit, the decimal point of the result, of the multiplication is 
shifted to the left by 2 (divide by 1 effect). Since this is a representation issue, no hardware is necessary 
to perform the shifts in the YFTGA (refer to Figure 4), only the host needs to maintain the values in 
the /^2[fc] memory in the appropriate format. The K2 Multiplier l nit multiplies the I\2 values for 
each class by the accumulated values of the difference between a pixel and a weight vector. It delivers 
a 44-bit result to the TO.XFPGA unit shown in Figure 1. Hits 0 to 23 represent, the fraction portion 
(remember that the decimal point is shifted to the left by 2), and bits 24 to 43 represent the integer 
part of the result. 

The next operation is to compute the exponential ol the negative ol this number. ( J i von t he precision 
of the following operations, any number above 24 will yield zero as a. result. Thus, it any of bits 43 to 
29 is set or both bits 28 and 27 are set, the result of c -1 ' should be zero. Only 28 bits are passed on 
to the Exponential LUT Unit, and they are bits 1 to 28. Hit 0 and bits 29 to 43 art' discarded. It was 
also found that a. considerable number of results of the multiplication are zero, which indicates that 
the result of the exponential should be one. In order to save processing steps in this case, the output 
of the multiplier is tested for zero, and a flag is passed to the Exponential LUT 1 nit, indicating that 
its result should be 1. 

A look-up table is used to determine the value of If we assume that a = h f- r, then: 

e~ a = e- {b + e] = (2) 

Since a is a 28-bit binary number, the value comprising bits 27 to 14 ol a represent. />, and the value 
comprising bits 13 to 0 of a represent c.. The range ol values ot h and c~ u are: 

00000.000000000 < 6 < 101 I 1.11111 1 1 1 1. (3) 

or 

0 < b < 23.9980409, (4) 

which results in 

0.9980519 > e" 5 > 3.78 x 10“ 11 
7 


( 5 ) 



The range of values of c and r ' are: 

OOOOO.OOOOOOOOOOOOOOOOOOOOOOl < c. < OOOOO.OOOOOOOOOI 11 1111111 1111, (6) 

or 

1.19 x 1(T 7 < c < x 1 (H 3 . (7) 

which results in 

0.999999881 > e“ c > 0.998109888 (8) 

The values of e~ b and c~ a re previously calculated and organized into a look-up table. At run 
time, the values of b and c are used to address the look-up table ston'd in the memory that is directly 
connected to the XFPGA. The values of e~ b and e~ c retrieved irom the look-up tahh' are then multiplied 
to give the value of c~ h . l he values stored in the look-up table? art? ‘{2-bits wide. 1 he result of the 
multiplication is 64-bits, but only the most significant 42 bits are sent out . As a result, 

4.77 x 10“ n < e~ a < 02)98051 781 . (9) 

The Class Accumulator I: nit sums up all the comparisons between a given pixel and all weights of 
a given class, and outputs the result when it receives a Hag indicating that the data, to add to the 
accumulator refers to the last weight in a class. The output ol the Exponential Multiplier l nit range 
is 3.77.1 0 -11 < d < 0.998051781. Thus, the largest accumulated value is 0.998051781 * 512 (max. of 
weights) — 511.002511872. In order to keep the precision ol <7, the accumulator is extended to 40 bits 
to accommodate the original 41 bits after the decimal point and 1 bit before the decimal point, and 
the new 8 bits before the decimal point. Each class has a A 1 value associated with it. 1 lie value of 
K 1 is determined by the following formula: 

(2-K) d/2 rr'll\. 

The result of the multiplication of A 1 by the accumulated differences between a pixel and all weights 
in a given class is compared with all other classes to determine the largest result, which indicates in 
which class a pixel most, probably belongs. In order to keep the values being multiplied in the same 
range allowing us to use lixed point arithmetic, the value's ol K1 are normalized as follows: Given d, 
1 7^ and P*., the host program calculates all K1 values, and divides them by the largest, one. The result 
is that one value of A 1 equals 1 and all the others are less than 1. 'The A l Multiplier l nit multiplies 
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the 40-bit result of the Class Accumulator Unit by the 42-bit, A 1 value from the Kl Memory Unit, 
and outputs a 40-bit result to the g register in the Class Comparison Unit. The Class Comparison 
Unit receives a value that represents the comparison between a pixel and all weights in a class, and 
compares this value against the values generated for all other classes. At the end of the calculation 
of all classes, it outputs a. code that, represents the class which presented the largest value or is the 
closest match. 

C. The host software 

The software that was developed for the PNN algorithm that executes on the host processor wa.s 
written in the Java programming language. We selected the Java programming language for several 
reasons. Java supports software reuse, native methods, remote method invocation, and it has a built-in 
security manager. Software reuse allows Java objects and methods to he used repeatedly in different 
applications. Native methods allow legacy code (old software written in another language) to be called 
directly from Java methods. The security manager and remote method invocation allow Java programs 
to be executed on remote CPUs with the system taking care of network traffic errors, security, etc. 
The FPGA system used for development of the hardware modules, contains drivers for interfacing to 
the FPGA devices that are only available in the C programming language. Java, was a useful choice 
for a programming language since native methods allow one to call C routines directly from Java. This 
is accomplished by building a dynamic link library that contains the ( ■ functions that interlace to the 
FPGA coprocessors. A .Java native' method is used to call these C (unctions directly. 

The application was implemented using a client/server methodology to provide an interface to the 
FPGA coprocessors from a remote site. The server program interlaces directly to the 1 (‘configurable 
accelerator via the C drivers. It receives a block of pixels Irom the client, initiates the classification 
of each of the pixels on the FPGA accelerator, gathers the results into a block of classified data, 
and sends the results hack to the client. The client, soli. ware' controls the user interface', image data 
input/output and translation, in addition to communication with the server. My selecting Java as 
a programming language and separating the program into client, and server subsystems, the client 
software is completely independent of the operating system that will execute the client, program. Only 
the server contains code t hat is not only dependent on the operating system used, but also depends on 
the specific, reconfigurable accelerator that has been selected. Hence, in t his paper we present results 
obtained from an implementation of the PNN algorithm that can he executed from a remote machine 
accessible, for example, on the Internet. 
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IV. Experimental Results 


In our experiments, we used the remote implementation of Hie PNN classifier to measure the effec- 
tiveness of a client/server approach to adaptive computing. Figure 5 illustrates a. potential scenario 
for remote image classi Meat ion. in this configuration, the server program has a. direct interface to the 
FPGA coprocessor. It initializes the FPGA board and loads the architecture shown in f igure 4 into 
the programmable hardware'. In our project, the server executes on a. workstation at NASA. The client 
program communicates with the server via the Internet. 1 lie client, requests a. connection with the 
server and, once granted, sends dat a to the server for processing. The server processes t he data and 
sends the results back to the client, for display. While the client is designed to execute at. a remote site, 
e.g. NCSU, in our experiments, both the client and server programs were executed on a single host at 
NASA. 

Two software implementations ol t he PNN algorithm were developed to compare' t he relative perfor- 
mance of implementations in two different programming languages. One version was written entirely 
in the Java programming language. The other version was written using the C programming language. 
The main routine in the client, spawned either the Java, or C versions ot the algorithm via a call from 
a normal or native method respectively. 

Two FPGA-based hardware versions of the PNN algorithm were implemented using single or mul- 
tiple modules. We report results using two modules as wo only had two modules available tor our 
experiments. In the single module case, one pixel or one Mock ot pixels wen' sent, to each FPGA 
coprocessor and the results wore returned to the client via tin.' server. In the two module experiments, 
one pixel or one block was sent to each of the two FPGA coprocessors in an at temp; to speedup 
algorithm execution by a. factor of 2. Each module in the multiple modulo case contained a, complete 
implementation of the hardware’ in Figure 4. 

A traditional version of the PNN Classifier algorithm was previously developed as the basis for 
the remote version presented in this paper. This experiment demonstrates the potential merits of a 
remote image classificat ion algorithm implementation. 1 he tradit ional version executed on a. 100 MHz 
Pentium PC. This implementation, written entirely in C, required 20 M CPI. seconds to classify the 
complete Blackhills data. set. My augmenting the PC with a single module running the PNN classifier 
at 16 MHz, the processing time was reduced to 220 ( 'IM seconds. In t his case, the adaptive computing 
implementation is 0.20 t imes faster t han the software version. Adding one additional module improved 
execution time to 90 CPF seconds. 

In our experiments with the remote PNN classifier, we ran a total ol I different scenarios presented 
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in Figure 6. The scenarios allow us to compare local and remote versions of t he algorithm that execute 
on the client and server with pixel-based or block-based algorit hms where one pixel or one block of 
pixels is processed. In each experiment, we present execution times tor two software implementations 
(written in Java and C) and two hardware implementations (one module or two modules}. 

In Table I, we present results of a remote implementation ol the image classification algorithm where 
one pixel is processed at a. time. Note that the implementation ol the algorithm in .lava requires 
7598 CPU seconds to complete. L he C version of the algorithm requires slightly more time since it is 
actually spawned from t he local client Java program to execute on the remote server workstation. (The 
overhead associated with calling a (.• function Iron) Java is included in the execution time.) For all 
practical purposes, the ( 1 and Java versions of the algorithm require approximately the same execution 
time. This was a strange result since Java is an interpreted language', however, we noticed a drastic 
improvement in the execution ol Java, programs using more recent versions ol the Java interpreter. 
The remote version of the algorit hm executing on a single FP(!A module was 8.0 j times laster than 
the remote software version. Also note that the addition ol one module in the multiple module case 
does not impact performance. 

The next experiment involved sending a block ol data, from the client to the server tor processing. The 
results of this experiment are also shown in Table I. In our experiments, an arbitrary block size (equal 
to 6 rows) was selected. (Future experiments will identify the optimal block size.) Since there are 512 
pixel vectors in a row, and 1 pixels per vector, one block coni ains I2.22S pixels. Not r t hat i he execution 
time of the remote Java, version of the block-based algorithm is significantly smaller than the pixel- 
based algorithm. The execution lime reduced from 7598 to 1858 (JIM. seconds. Once again, the single 
module implementation was significantly (7.6 times) taster than the remote software version written in 
Java. The addition of a second module did not provide a speedup flue to lhe overhead associated with 
sending a block of data to the server. Please note that, the 1PC1A coprocessor consistently processes 
a pixel at a time, however, the server will wait lor all pixels in a block to be processed before sending 
the results back to the client. 

Table II presents results of PNN classification execut ing on a local workstation, l he client program 
can initiate execution of either of the software or hardware algorithm implementations. In the local 
pixel-based algorithm, the Java version requires about 1817 CPI seconds and the single module 
implementation requires 1-11 ('PI seconds. This is approximately an order of magnitude improvement 
in execution time. The multiple module version completes in 77 seconds resulting in a 2:1 speedup 
over the single module as expected. The results from Table II illustrate that- block-based processing is 



counterproductive on a loc al client workstation. 


V. Conclusions 

In this paper, it was shown that I he implementation of a mnltispectral image classifier on an adaptive 
computer yields an order of magnitude performance increase over high end workstations. If we extract 
the fastest execution times for the algorithm from the Tallies presented, we find an interesting result 
that relates to the potential impact of remote adaptive computing technology. The fastest remote 
hardware implementation of the PNN algorithm consisted of a single' module requiring 178 CPU 
seconds to complete. On the other hand, the fastest local software' version of the algorithm was the 
Java version that required I TOO UIM seconds. This is 7.35 times slower than the remote hardware 
implementation. Hence, for image' classification, a remote hardware implementation of the algorithm 
is faster than a local software' implementation of the algorithm, future' work is to identify additional 
applications wherein a remote' hardware implementation is consistently 1 aster than a local software 
version. Additionally experiments that quantify the' effects of a. heavily loaded ne'twork connection 
should be conducted. 
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